Exploratory Data Analysis
Learn how to answer basic questions about your data
Learn how to identify interesting relationships in your data
Use your new data science tools to better understand your data
Upload your scripts to Github
Track changes to your documents, code, or data over time
Work from one document
Have access to your work from anywhere
Create safe points in case something breaks or you want to experiment
Open source version control software.
Think R.
A website that allows you to store your Git repositories online and makes it easy to collaborate with others.
Think RStudio.
More reproducible, transparent research
Better version control
Easy collaboration with others
Four verbs you need to know to use Git for version control:
add
commit
push
pull
Three different options:
RStudio GUI
Shell/terminal
Github desktop1
A repository is like a folder for your project, but better!
Organises your work
Displays useful information, including a general description, navigation, changes
A great tool for project-oriented workflows
We already have R projects that we started yesterday.
We can sync the existing R project with our new repository.
The usethis R package is a brilliant helper package.
Github is like Google Docs for your code.
Create a new Github repository for this camp.
Sync your existing R project to this new repository.
add your scripts from yesterday and today.
Write a helpful commit message for your future self.
push your work up to Github.
Add me as a collaborator: @hgoers.
Exploratory data analysis is a critical step in your quantitative research process.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
gapminder# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
What is the earliest and latest year we cover?
# A tibble: 1 × 2
`min(year)` `max(year)`
<int> <int>
1 1952 2007
What about our other numeric variables?
The five number summary is a useful way to summarise numeric data. Consists of the:
Minimum,
25th percentile,
50th percentile (mean or average),
75th percentile,
Maximum
Does one variable tend to move in the same direction as another?
A quick look with glimpse():
A quick summary with skim():
Today you:
Learnt how to explore and visualise interesting relations in your data
Used your new data science tools to better understand your data